Lab Notebook

Back to Part 6

Note: Interpreting BLAST search results

Let's consider how one might go about assigning a numerical value to the degree of similarity between two DNA sequences. Suppose we have two sequences as follows:
CGGCAT
CGCGAT
Let's assign 1 point for each base pair that matches exactly and 0 point for each base pair that does not. We have C-C (match), G-G (match), G-C (no match), C-G (no match), A-A (match), and T-T (match) for a total of 4 points. Under this hypothetical system, the more nucleotides that match up, the higher is the score.

When comparing two DNA sequences, it's important to remember that because of evolutionary history, the sequences may have diverged not only by substitution of bases but also possibly by deletions or insertions of bases. This means that the sequences that are being matched may not be exactly the same length but might have gaps. In practical terms, for these two sequences, the best match is
CGGC-AT
CG-CGAT
for a total of 5 points.
Another possible alignment is
CG-GCAT
CGCG-AT
for a total of 5 points.

From the simple example above, you can imagine how rapidly sequence comparisons can become complicated as DNA length increases. The statistics for comparing two sequences of DNA are thus highly complicated. Here we cover just the bare essence of the topic so that you can interpret the response from your sequence query.

Let's suppose you do a BLAST search of the following sequence:
TATCGCGTATTGCC
BLAST will come back with a result, starting with the reference of the search program, the number of letters in your sequence, the number of letters in the database, a graphic representation of the sequence matches, and a list of matches. The list of matches is sorted with the best matching sequences shown first. For the sequence we used, the list starts with the following:

                                           Score    E
Sequences producing significant alignments:(bits) Value

gb|AC012156.14|AC012  Homo sapiens chr 12..  28    5.8
ref|NC_001142.1 Saccharomyces cerevisiae...  28    5.8

What does this mean? "Score" is a numerical score assigned by BLAST. In the simple example, we used earlier, we simply assigned 1 point for matches, 0 point for non-matches. In BLAST, the scoring system uses "bits" as the measure of information. For DNA, each position can be occupied by either T, A, C, or G. Each match therefore contains 2 bits of information (only 1 is correct out of 4 possibles). For a 14-nucleotide-long sequence like ours, the maximum match score then is 28 bits. The higher the score, the better is the match.

"E-value" is the number of hits one can expect to see just by chance when searching a database of a particular size. The value is defined as
E = N/n * m * n * 2^-S
where m and n are the length of the two nucleotide sequences (measured in base pairs), S is the bit score, and N refers to the total length of all sequences in the database. The formula should make intuitive sense. For example, if S is higher (i.e., better matches), you would expect to see fewer "hits." On the other hand, if m or n are larger (i.e., one or the other sequence is longer), then you would expect to see more hits purely by chance. Finally, if the database contains more sequences (i.e., N is larger), then you would expect to see more hits. In any case, if BLAST returns an E-value that is very small or close to zero, then you probably have a meaningful match that is not due to random chance.

To interpret the matches, you therefore need to pay attention to whether the E-value is reasonably small. E-value is related to the P-value by the following formula:
P = 1 - e^-E
So for a P-value of 0.95 (the statistically significant level), the E-value is around 3. Thus, in your search, an E-value of 3 or less would be an acceptable match.

You should also keep in mind that there are a lot of sequences in the database and that some of them are from the same species and therefore might be very similar. In some cases, the name of the organism may have changed after it was originally reported; accordingly, two or more sequences may match extremely well but appear to belong to completely different species.

Consult the BLAST tutorial page for references and descriptions of the statistics used in BLAST, click here (Internet connection required).

Back to Part 6